library(fpp3)
## ── Attaching packages ────────────────────────────────────────────── fpp3 0.5 ──
## ✔ tibble      3.1.8     ✔ tsibble     1.1.3
## ✔ dplyr       1.1.0     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.0     ✔ feasts      0.3.0
## ✔ lubridate   1.9.2     ✔ fable       0.3.2
## ✔ ggplot2     3.4.1     ✔ fabletools  0.3.2
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()
library(tsibble)

2.1 tsibble objects

The index variable

y <- tsibble(
  Year = 2015:2019,
  Observation = c(123, 39, 78, 52, 110),
  index = Year
)
y

tsibble introduces temporal structure to tidy data frames.

A regular tibble can be converted to a tsibble object

z <- tibble(
  Month = c("2019 Jan", "2019 Feb", "2019 Mar", "2019 Apr", "2019 May"),
  Observation = c(50, 23, 34, 30, 25)
)
z
z |>
  mutate(Month = yearmonth(Month)) |>
  as_tsibble(index = Month)

mutate() converts the “Month” column to the appropriate variable type, in this case yearmonth. as_tsibble() sets the index for the tsibble object

Frequency Function
Annual start:end
Quarterly yearquarter()
Monthly yearmonth()
Weekly yearweek()
Daily as_date(), ymd()
Sub-Daily as_datetime(), ymd_hms()

The key variables

olympic_running

The summary shows that the tsibble object has 312 rows, 4 columns, and the index is in 4 year intervals. Additionally there are 14 seperate time series uniquely identified by the keys Length and Sex. distinct() can be used to show categories and combinations of each variable.

olympic_running |> distinct(Sex)
olympic_running |> distinct(Length)

Working with tsibble objects

dplyr functions include mutate(),filter(),select(),summarise()`

PBS

This data set contains monthly data on Medicare Australia prescription data from July 1991 to June 2008.

select()

PBS |>
  filter(ATC2 == "A10")

filter()

PBS |>
  filter(ATC2 == "A10") |>
  select(Month, Concession, Type, Cost)

Note that Month, Concession and Type would be returned automatically to ensure that each row contains a unique combination of index and keys.

summarise()

PBS |>
  filter(ATC2 == "A10") |>
  select(Month, Concession, Type, Cost) |>
  summarise(TotalC = sum(Cost))

mutate()

Used to create new variables

PBS |>
  filter(ATC2 == "A10") |>
  select(Month, Concession, Type, Cost) |>
  summarise(TotalC = sum(Cost)) |>
  mutate(Cost = TotalC/1e6) # Convert from dollars to millions of dollars

Save the tsibble object

PBS |>
  filter(ATC2 == "A10") |>
  select(Month, Concession, Type, Cost) |>
  summarise(TotalC = sum(Cost)) |>
  mutate(Cost = TotalC/1e6) -> a10

Read a csv file and convert to a tsibble

prison <- readr::read_csv("https://OTexts.com/fpp3/extrafiles/prison_population.csv")
## Rows: 3072 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): State, Gender, Legal, Indigenous
## dbl  (1): Count
## date (1): Date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
prison

When converting the table to a tsibble object we define the index and key columns. Note that the data is quarterly but is stored as individual days.

prison <- prison |>
  mutate(Quarter = yearquarter(Date)) |>
  select(-Date) |>
  as_tsibble(key = c(State, Gender, Legal, Indigenous),
             index = Quarter)

prison

The seasonal period

Common periods for time intervals

Data Minute Hour Day Week Year
Quarters 4
Months 12
Weeks 52
Days 7 365.25
Hours 24 168 8766
Minutes 60 1440 10080 525960
Seconds 60 3600 86400 6048000 31557600

More complicated and unusual seasonal patterns can be specified using period() in the lubricate package.

2.2 Time plots

ansett

Let’s look at the weekly economy passenger load on Ansett Airlines between their two largest cities.

melsyd_economy <- ansett |>
  filter(Airports == "MEL-SYD", Class == "Economy") |>
  mutate(Passengers = Passengers/1000)
autoplot(melsyd_economy, Passengers) +
  labs(title = "Ansett airlines economy class",
       subtitle = "Melbourne-Sydney",
       y = "Passengers (,000)")

Note:

A model needs to take all these features into account.

Looking at our simpler time series from 2.1:

PBS |>
  filter(ATC2 == "A10") |>
  select(Month, Concession, Type, Cost) |>
  summarise(TotalC = sum(Cost)) |>
  mutate(Cost = TotalC/1e6) -> a10
autoplot(a10, Cost) +
  labs(y = "$ (millions)",
       title = "Australian antidiabetic drug sales")

This shows a clear and increasing trend.

2.3 Time series patterns

  1. The upper left image shows strong seasonality within each year and cyclic behaviour with a period between 6-10 year.
  2. The upper right shows no seasonality but a clear downward trend.
  3. The bottom left shows both seasonality and trend but no apparent cyclical nature
  4. The bottom left shows neither trend, seasonality or cyclic aspects.

2.4 Seasonal plots

a10 |>
  gg_season(Cost, labels = "both") +
  labs(y = "$ (millions)",
       title = "Seasonal plot: Antidiabetic drug sales")

Useful for identifying Years with with unusual patterns. In this case we see 2008 dipping lower than usual in March 2008. The low figure in June could be the result of incomplete data.

Multiple seasonal periods

period can be used for data with multiple seasonal patterns by specifying daily, weekly or yearly patterns

vic_elec
vic_elec |> gg_season(Demand, period = "day") +
  theme(legend.position = "none") +
  labs(y = "Mwh", title = "Electricity demain in Victoria")

vic_elec |> gg_season(Demand, period = "week") +
  theme(legend.position = "none") +
  labs(y = "Mwh", title = "Electricity demain in Victoria")

vic_elec |> gg_season(Demand, period = "year") +
  theme(legend.position = "none") +
  labs(y = "Mwh", title = "Electricity demain in Victoria")

2.5 Seasonal subseries plots

a10 |>
  gg_subseries(Cost) +
  labs(
    y = "$ (million)",
    title = "Australian antidibetic drug sales"
  )

This plot collects each season into seperate mini time plots. The blue line indicates the mean. This plot is useful to clearly view the underlying seasonal pattern and shows the changes in seasonality over time. In this case it is not very interesting.

Example: Australian holiday tourism

tourism

Lets consider the total visitor nights on holiday for each quarter.

holidays <- tourism |>
  filter(Purpose == "Holiday") |>
  group_by(State) |>
  summarise(Trips = sum(Trips))
holidays
autoplot(holidays, Trips) +
  labs(y = "Overnight trips",
       title = "Australian domestic holidays")

The plot shows strong seasonality for most states but the peaks do not necessarily coincide.To see the timing of those seasonal peaks use a gg_season() plot.

gg_season(holidays, Trips) +
  labs(y = "Overnight trips ('000)",
       title = "Austrlian domestic holidays")

And a gg_subseries() plot

gg_subseries(holidays, Trips) +
  labs(y = "Overnight trips ('000)",
       title = "Austrlian domestic holidays")

Here we see the increase in Western Australia in recent years. Also Victoria’s strong increas in Q1 and Q4 but not in the other quarters.

2.6 Scatterplots

The above plots are used to explore individual time series. Scatterplots can explore the relationship between two time series.

The following shows half-hourly electricity demand and temperature for 2014 in Victoria Australia.

vic_elec |>
  filter(year(Time) == 2014) |>
  autoplot(Demand) +
  labs(y = "GW",
       title = "Half-hourly electricity demand: Victoria")

vic_elec |>
  filter(year(Time) == 2014) |>
  autoplot(Temperature) +
  labs(
    y = "Degrees Celsius",
    title = "Half-hourly temperatures: Melbourne, Australia"
  )

vic_elec |>
  filter(year(Time) == 2014) |>
  ggplot(aes(x = Temperature, y = Demand)) +
  geom_point() +
  labs(x = "Temperature (degrees Celsius)",
       y = "Electricity demand")

This shows a high demand when temperatures are high as well as when they are low.

Correlation

The correlation coefficients measure the strength of the linear relationship between two variables.

\[ r = \frac{\sum (x_{t} - \bar{x})(y_{t}-\bar{y})}{\sqrt{\sum(x_{t}-\bar{x})^2}\sqrt{\sum(y_{t}-\bar{y})^2}} \]

for \(r\) between -1 and 1.

The correlation coefficient only measures the strenght of the linear relationship.

These plots all have correlation coefficients of 0.82 but have very different relationships. Plots are important. Do not rely on correlation values alone.

Scatterplot matrices

A useful starting point is to plot each variable agains each other variable.

visitors <- tourism |>
  group_by(State) |>
  summarise(Trips = sum(Trips))
visitors |>
  ggplot(aes(x = Quarter, y = Trips)) +
  geom_line() +
  facet_grid(vars(State), scales = "free_y") +
  labs(title = "Australian domestic tourism",
       y= "Overnight trips ('000)")

visitors |>
  pivot_wider(values_from=Trips, names_from=State) |>
  GGally::ggpairs(columns = 2:9)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

2.7 Lag plots

recent_production <- aus_production |>
  filter(year(Quarter) >= 2000)
recent_production |>
  gg_lag(Beer, geom = "point") +
  labs(x = "lag(Beer, k)")

Note the strong positive relationship at lags 4 and 8 reflecting seasonality. The strong negative correlations in lags 2 and 6 reflect peaks in Q4 being plotted against troughs in Q2.

2.8 Autocorrelation

Autocorrelation measures the linear relationship between lagged values of a time series.

There are several autocorrelation coefficients corresponding to each panel in the lag plots. Eg. \(r_1\) measures the relationship between \(y_t\) and \(y_{t-1}\), \(r_2\) for \(y_t\) and \(y_{t-2}\), etc.

\[ r_{k} = \frac{\sum\limits_{t=k+1}^T (y_{t}-\bar{y})(y_{t-k}-\bar{y})} {\sum\limits_{t=1}^T (y_{t}-\bar{y})^2} \]

where \(T\) is the length of the time series. The autocorrelation function ACF() can be used to compute the coefficients.

recent_production |> ACF(Beer, lag_max = 9)

The acf column contains \(r_1, r_2,\dots,r_9\). We can plot these with a correlogram.

recent_production |>
  ACF(Beer) |>
  autoplot() + labs(title = "Australian beer production")

Trend and seasonality in ACF plots

For data with a trend there is strong autocorrelation for small lags because observations near in time are also near in value. Therefore the ACF will ususally show positive values which slowly decrease as the lags increase.

When data are seasonal the autocorrelation will be larger for seasonal lags (multiples of the seasonal period).

a10 is an example of a time series with both seasonal and trend components.

a10 |>
  ACF(Cost, lag_max = 48) |>
  autoplot() +
  labs(title = "Australian antidiabetic drug sales")

2.9 White noise

White noise refers to data with no autocorrelation

set.seed(30)
y <- tsibble(sample = 1:50, 
             wn = rnorm(50), 
             index = sample)
y |> autoplot(wn) + labs(title = "White noise", y = "")

y |>
  ACF(wn) |>
  autoplot() + labs(title = "White noise")

For white noise each autocorrelation value should be close to zero. More specifically 95% of the spikes should lie within \(\pm 2/\sqrt{T}\) where \(T\) is the length of the time series. If one or more large spikes or more than 5% of spikes lie outside the 95% bounds it is probably not pure white noise.